50 research outputs found

    Large-scale Hierarchical Alignment for Data-driven Text Rewriting

    Full text link
    We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own.Comment: RANLP 201

    Character-level Chinese-English Translation through ASCII Encoding

    Full text link
    Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.Comment: 7 pages, 3 figures, 3rd Conference on Machine Translation (WMT18), 201

    Embedding-based Scientific Literature Discovery in a Text Editor Application

    Full text link
    Each claim in a research paper requires all relevant prior knowledge to be discovered, assimilated, and appropriately cited. However, despite the availability of powerful search engines and sophisticated text editing software, discovering relevant papers and integrating the knowledge into a manuscript remain complex tasks associated with high cognitive load. To define comprehensive search queries requires strong motivation from authors, irrespective of their familiarity with the research field. Moreover, switching between independent applications for literature discovery, bibliography management, reading papers, and writing text burdens authors further and interrupts their creative process. Here, we present a web application that combines text editing and literature discovery in an interactive user interface. The application is equipped with a search engine that couples Boolean keyword filtering with nearest neighbor search over text embeddings, providing a discovery experience tuned to an author's manuscript and his interests. Our application aims to take a step towards more enjoyable and effortless academic writing. The demo of the application (https://SciEditorDemo2020.herokuapp.com/) and a short video tutorial (https://youtu.be/pkdVU60IcRc) are available online

    Character-Level Translation with Self-attention

    Full text link
    We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three input languages (French, Spanish, and Chinese). Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.Comment: ACL 202

    Improving efficiency of supercontinuum generation in photonic crystal fibers by direct degenerate four-wave-mixing

    Get PDF
    We numerically study supercontinuum (SC) generation in photonic crystal fibers pumped with low-power 30-ps pulses close to the zero dispersion wavelength 647nm. We show how the efficiency is significantly improved by designing the dispersion to allow widely separated spectral lines generated by degenerate four-wave-mixing (FWM) directly from the pump to broaden and merge. By proper modification of the dispersion profile the generation of additional FWM Stokes and anti-Stokes lines results in efficient generation of an 800nm wide SC. Simulations show that the predicted efficient SC generation is more robust and can survive fiber imperfections modelled as random fluctuations of the dispersion coefficients along the fiber length.Comment: Submited to Journal of the Optical Society of America B on 16 September 200

    Quadratic solitons as nonlocal solitons

    Get PDF
    We show that quadratic solitons are equivalent to solitons of a nonlocal Kerr medium. This provides new physical insight into the properties of quadratic solitons, often believed to be equivalent to solitons of an effective saturable Kerr medium. The nonlocal analogy also allows for novel analytical solutions and the prediction of novel bound states of quadratic solitons.Comment: 4 pages, 3 figure

    The genetic history of the Southern Arc: a bridge between West Asia and Europe

    Get PDF
    By sequencing 727 ancient individuals from the Southern Arc (Anatolia and its neighbors in Southeastern Europe and West Asia) over 10,000 years, we contextualize its Chalcolithic period and Bronze Age (about 5000 to 1000 BCE), when extensive gene flow entangled it with the Eurasian steppe. Two streams of migration transmitted Caucasus and Anatolian/Levantine ancestry northward, and the Yamnaya pastoralists, formed on the steppe, then spread southward into the Balkans and across the Caucasus into Armenia, where they left numerous patrilineal descendants. Anatolia was transformed by intra–West Asian gene flow, with negligible impact of the later Yamnaya migrations. This contrasts with all other regions where Indo-European languages were spoken, suggesting that the homeland of the Indo-Anatolian language family was in West Asia, with only secondary dispersals of non-Anatolian Indo-Europeans from the steppe

    Abstractive Document Summarization in High and Low Resource Settings

    No full text
    Automatic summarization aims to reduce an input document to a compressed version that captures only its salient parts. It is a topic with growing importance in today's age of information overflow. There are two main types of automatic summarization. Extractive summarization only selects salient sentences from the input, while abstractive summarization generates a summary without explicitly re-using whole sentences, resulting in summaries are often more fluent. State-of-the-art approaches to abstractive summarization are data-driven, relying on the availability of large collections of paired articles with summaries. The pairs are typically manually constructed, a task which is costly and time-consuming. Furthermore, when targeting a slightly different domain or summary format, a new parallel dataset is often required. This large reliance on parallel resources limits the potential impact of abstractive summarization systems in society. In this thesis, we consider the problem of abstractive summarization from two different perspectives: high-resource and low-resource summarization. In the first part, we compare different methods for data-driven summarization, focusing specifically on the problem of generating long, abstractive summaries, such as an abstract for a scientific journal article. We discuss the difficulties that come with abstractive generation of long summaries and propose methods for alleviating them. In the second part of this thesis, we develop low-resource methods for abstractive text rewriting, first focusing on individual sentences and then on whole summaries. Our methods do not rely on parallel data, but instead utilize raw non-parallel text collections. In overall, this work makes a step towards data-driven abstractive summarization for the generation of long summaries, without having to rely on vast amounts of parallel, manually curated data

    Abstractive Document Summarization without Parallel Data

    Full text link
    Abstractive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to include in the final summary, as well as a sentence abstractor that is trained on pseudo-parallel and synthetic data, that paraphrases each of the extracted sentences. We perform an extensive evaluation of our method: on the CNN/DailyMail benchmark, on which we compare our approach to fully supervised baselines, as well as on the novel task of automatically generating a press release from a scientific journal article, which is well suited for our system. We show promising performance on both tasks, without relying on any article-summary pairs.Comment: LREC 202
    corecore